AITopics | answer space

Collaborating Authors

answer space

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Karimi, Ehsan, Le, Nhut, Rahnemoonfar, Maryam

arXiv.org Artificial IntelligenceNov-26-2025

--Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial V ehicles (UA Vs), providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. T o address these limitations, we propose Think First, Assign Next (ThiF AN-VQA), a two-stage reasoning-based framework for Visual Question Answering (VQA) in disaster scenarios. ThiF AN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. Experiments on FloodNet and RescueNet-VQA, UA V-based datasets from flood-and hurricane-affected regions, demonstrate that ThiF AN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks. N the immediate aftermath of natural disasters, first responders rely heavily on up-to-date information to assess damage, identify hazards, allocate resources, and reach survivors as quickly as possible.

assessment, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.19557

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games

Yuan, Yuan, He, Muyu, Shahid, Muhammad Adil, Huang, Jiani, Li, Ziyang, Zhang, Li

arXiv.org Artificial IntelligenceSep-23-2025

This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, the number of reasoning step and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs' deductive reasoning abilities in complex, narrative-rich environments.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.15712

Country:

North America (0.28)
Asia (0.28)

Genre: Research Report (0.64)

Industry: Law > Litigation (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Why Does Your CoT Prompt (Not) Work? Theoretical Analysis of Prompt Space Complexity, its Interaction with Answer Space During CoT Reasoning with LLMs: A Recurrent Perspective

Zhang, Xiang, Cao, Juntai, Wei, Jiaqi, You, Chenyu, Ding, Dujian

arXiv.org Artificial IntelligenceMar-13-2025

Despite the remarkable successes of Large Language Models (LLMs), their fundamental Transformer architecture possesses inherent theoretical limitations that restrict their capability to handle reasoning tasks with increasing computational complexity. Chain-of-Thought (CoT) prompting has emerged as a practical solution, supported by several theoretical studies. However, current CoT-based methods (including ToT, GoT, etc.) generally adopt a "one-prompt-fits-all" strategy, using fixed templates (e.g., "think step by step") across diverse reasoning tasks. This method forces models to navigate an extremely complex prompt space to identify effective reasoning paths. The current prompt designing research are also heavily relying on trial-and-error rather than theoretically informed guidance. In this paper, we provide a rigorous theoretical analysis of the complexity and interplay between two crucial spaces: the prompt space (the space of potential prompt structures) and the answer space (the space of reasoning solutions generated by LLMs) in CoT reasoning. We demonstrate how reliance on a single universal prompt (e.g. think step by step) can negatively impact the theoretical computability of LLMs, illustrating that prompt complexity directly influences the structure and effectiveness of the navigation in answer space. Our analysis highlights that sometimes human supervision is critical for efficiently navigating the prompt space. We theoretically and empirically show that task-specific prompting significantly outperforms unsupervised prompt generation, emphasizing the necessity of thoughtful human guidance in CoT prompting.

final output, input list, template, (15 more...)

arXiv.org Artificial Intelligence

2503.10084

Country:

North America > United States > New York > Suffolk County > Stony Brook (0.04)
North America > Canada > British Columbia (0.04)
Asia > Singapore (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Supervised Chain of Thought

Zhang, Xiang, Ding, Dujian

arXiv.org Artificial IntelligenceOct-18-2024

Large Language Models (LLMs) have revolutionized natural language processing and hold immense potential for advancing Artificial Intelligence. However, the core architecture of most mainstream LLMs--the Transformer--has inherent limitations in computational depth, rendering them theoretically incapable of solving many reasoning tasks that demand increasingly deep computations. Chain of Thought (CoT) prompting has emerged as a technique to address these architectural limitations, as evidenced by several theoretical studies. It offers a promising approach to solving complex reasoning tasks that were previously beyond the capabilities of these models. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a "one-prompt-for-all" approach, using a single prompt structure (e.g., "think step by step") for a wide range of tasks--from counting and sorting to solving mathematical and algorithmic problems. This approach poses significant challenges for models to generate the correct reasoning steps, as the model must navigate through a vast prompt template space to find the appropriate template for each task. In this work, we build upon previous theoretical analyses of CoT to demonstrate how the one-promptfor-all approach can negatively affect the computability of LLMs. Our findings show that task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance. Through experiments with state-of-the-art LLMs, we reveal a gap in reasoning performance when supervision is applied versus when it is not. Our goal is to provide deeper insights into the mechanisms underlying CoT, offering guidance for the effective design of CoT variants. Additionally, we underscore the limitations of traditional "unsupervised" prompting methods, arguing that users of CoT cannot simply "sit back" and rely entirely on the model.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.14198

Country:

North America > Canada > British Columbia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models

Ploner, Max, Wiland, Jacek, Pohl, Sebastian, Akbik, Alan

arXiv.org Artificial IntelligenceAug-28-2024

Knowledge probing evaluates the extent to which a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-PUB- QUIZ, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face TRANSFORMERS library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-PUB-QUIZ as an open-source project.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2408.15729

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Antarctica (0.05)
Africa > Uganda > Central Region > Kampala (0.05)
(7 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Progress or Regress? Self-Improvement Reversal in Post-training

Wu, Ting, Li, Xuefeng, Liu, Pengfei

arXiv.org Artificial IntelligenceJul-6-2024

Self-improvement through post-training methods such as iterative preference learning has been acclaimed for enhancing the problem-solving capabilities (e.g., mathematical reasoning) of Large Language Models (LLMs) without human intervention. However, as exploration deepens, it becomes crucial to assess whether these improvements genuinely signify progress in solving more challenging problems or if they could lead to unintended regressions. To address this, we propose a comprehensive evaluative framework that goes beyond the superficial pass@1 metric to scrutinize the underlying enhancements of post-training paradigms for self-improvement. Through rigorous experimentation and analysis across diverse problem-solving tasks, the empirical results point out the phenomenon of \emph{self-improvement reversal}, where models showing improved performance across benchmarks will paradoxically exhibit declines in broader, essential capabilities, like output diversity and out-of-distribution (OOD) generalization. These findings indicate that current self-improvement practices through post-training are inadequate for equipping models to tackle more complex problems. Furthermore, they underscore the necessity of our critical evaluation metrics in discerning the \emph{progress or regress} dichotomy for self-improving LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2407.05013

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(4 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models

Wiland, Jacek, Ploner, Max, Akbik, Alan

arXiv.org Artificial IntelligenceApr-5-2024

Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.

computational linguistic, relation, template, (15 more...)

arXiv.org Artificial Intelligence

2404.04113

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France (0.05)
Africa > Uganda > Central Region > Kampala (0.05)
(32 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Program-Aided Reasoners (better) Know What They Know

Kabra, Anubha, Rangreji, Sanketh, Mathur, Yash, Madaan, Aman, Liu, Emmy, Neubig, Graham

arXiv.org Artificial IntelligenceNov-15-2023

Prior work shows that program-aided reasoning, in which large language models (LLMs) are combined with programs written in programming languages such as Python, can significantly improve accuracy on various reasoning tasks. However, while accuracy is essential, it is also important for such reasoners to "know what they know", which can be quantified through the calibration of the model. In this paper, we compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets and 2 model types: LLaMA models and OpenAI models. Our results indicate that PAL leads to improved calibration in 75% of the instances. Our analysis uncovers that prompting styles that produce lesser diversity in generations also have more calibrated results, and thus we also experiment with inducing lower generation diversity using temperature scaling and find that for certain temperatures, PAL is not only more accurate but is also more calibrated than COT. Overall, we demonstrate that, in the majority of cases, program-aided reasoners better know what they know than text-based counterparts.

accuracy, calibration, llama model, (15 more...)

arXiv.org Artificial Intelligence

2311.09553

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Open-Set Knowledge-Based Visual Question Answering with Inference Paths

Gan, Jingru, Han, Xinzhe, Wang, Shuhui, Huang, Qingming

arXiv.org Artificial IntelligenceOct-12-2023

Given an image and an associated textual question, the purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. Prior KB-VQA models are usually formulated as a retriever-classifier framework, where a pre-trained retriever extracts textual or visual information from knowledge graphs and then makes a prediction among the candidates. Despite promising progress, there are two drawbacks with existing models. Firstly, modeling question-answering as multi-class classification limits the answer space to a preset corpus and lacks the ability of flexible reasoning. Secondly, the classifier merely consider "what is the answer" without "how to get the answer", which cannot ground the answer to explicit reasoning paths. In this paper, we confront the challenge of \emph{explainable open-set} KB-VQA, where the system is required to answer questions with entities at wild and retain an explainable reasoning path. To resolve the aforementioned issues, we propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity). Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process. To comprehensively evaluate our model, we reformulate the benchmark dataset OK-VQA with manually corrected entity-level annotations and release it as ConceptVQA. Extensive experiments on real-world questions demonstrate that our framework is not only able to perform open-set question answering across the whole knowledge base but provide explicit reasoning path.

graph, node, schema graph, (15 more...)

arXiv.org Artificial Intelligence

2310.08148

Country: Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback

Adaptive Prompt Learning with Distilled Connective Knowledge for Implicit Discourse Relation Recognition

Wang, Bang, Wang, Zhenglin, Xiang, Wei, Mo, Yijun

arXiv.org Artificial IntelligenceSep-14-2023

Implicit discourse relation recognition (IDRR) aims at recognizing the discourse relation between two text segments without an explicit connective. Recently, the prompt learning has just been applied to the IDRR task with great performance improvements over various neural network-based approaches. However, the discrete nature of the state-art-of-art prompting approach requires manual design of templates and answers, a big hurdle for its practical applications. In this paper, we propose a continuous version of prompt learning together with connective knowledge distillation, called AdaptPrompt, to reduce manual design efforts via continuous prompting while further improving performance via knowledge transfer. In particular, we design and train a few virtual tokens to form continuous templates and automatically select the most suitable one by gradient search in the embedding space. We also design an answer-relation mapping rule to generate a few virtual answers as the answer space. Furthermore, we notice the importance of annotated connectives in the training dataset and design a teacher-student architecture for knowledge transfer. Experiments on the up-to-date PDTB Corpus V3.0 validate our design objectives in terms of the better relation recognition performance over the state-of-the-art competitors.

answer space, contingency, knowledge, (17 more...)

arXiv.org Artificial Intelligence

2309.07561

Country:

Asia > China > Hubei Province > Wuhan (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (0.82)

Industry: Education (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback